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A path-based approach to random walks on networks 
characterizes how proteins evolve new function 
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We develop a path-based approach to continuous-time random walks on networks with arbitrarily 
weighted edges. We describe an efficient numerical algorithm for calculating statistical properties 
of the stochastic path ensemble. After demonstrating our approach on two reaction rate problems, 
we present a biophysical model that describes how proteins evolve new functions while maintain- 
ing thermodynamic stability. We use our methodology to characterize dynamics of evolutionary 
adaptation, reproducing several key features observed in directed evolution experiments. We find 
that proteins generally fall into two qualitatively different regimes of adaptation depending on their 
binding and folding energetics. 



Random walks on networks are ubiquitous across 
physics, chemistry, and biology, including molecular evo- 
lution [IH3], protein folding [4], chemical reactions [5], 
transport and search in complex media [6] [7], stochas- 
tic phenotypes [8], and cell-type differentiation [9HTT]. 
Each node on the network is assigned a value of the ob- 
jective function (for example, energy or fitness) which 
defines the rates of jumping to the neighboring nodes. 
Statistical properties of random walks determine quan- 
tities of interest such as mean first-passage times and 
path length distributions. Characterizing the diversity 
of stochastic paths is a central issue in evolutionary the- 
ory nHai na ng . 

Analytical treatments of random walks on networks 
tend to be limited to simple models with equally weighted 
edges [6j [I4j [15] , while direct simulations can be com- 
putationally expensive, especially when rare events are 
considered. In reaction rate theory, ensembles of stochas- 
tic trajectories may be built by transition path sampling 
[T6HT9] : however, this method involves considerable com- 
putational costs in complex systems. Another alterna- 
tive, transition path theory [4j[20][2l], requires a numer- 
ical solution of the backward equation. Neither approach 
directly addresses the diversity of stochastic paths. 

Here we develop a systematic and numerically effi- 
cient path-based approach to stochastic processes. Our 
method is applicable to semi-Markov jump processes (i.e., 
continuous-time random walks [22 ) on networks with ar- 
bitrary edge weights. The approach is well-suited for ob- 
taining statistics that describe the diversity of paths, such 
as the distribution of path lengths and path entropy. We 
use it to study adaptive dynamics of proteins evolving 
a new function while maintaining thermodynamic stabil- 
ity [12, 23-26 , a phenomenon of central interest in both 
natural and directed evolution (the latter aimed at engi- 
neering proteins with novel enzymatic activities [26, 27 ). 

A semi-Markov process on the state space S is de- 
fined by a set of jump probabilities, (cr / |Q|cr) for a — >> a' 
(a, a' E <S), and waiting time distributions ip a (t), where 



ip a (t) is the PDF of waiting exactly time t in state a 
before making a jump [22]. We assume that Vv(^) nas 
finite mean w(a) for all a E S. Note that S equipped 
with the jump matrix Q defines a network with directed, 
weighted edges. 



Define a path p as a sequence of states {cr , cri, . . . , cr^}. 
The time-independent probability of the system taking 
thepath(/?isll[(/?] = 7r(cr ) n^o^+ilQl "*)' where 7r(cr ) 
is the initial distribution. Let <£ be an ensemble of paths; 
for example, all first-passage paths from a set of ini- 
tial states Si to a set of final states Sf. The partition 
function for this ensemble is Z$ = XLe3>-^-M ( n °te 
that Z$ equals the normalization of the initial distri- 
bution 7r(<Jo) by probability conservation), and the en- 
tropy is S& = -Z~ x E^g$ n M log(II[<p]/£$). Let C[p\ 
be the length (number of jumps) of path p : and let 
7~M = Si=o w ( (J i) b e ^ ne average time of the path. We 
also define T a [p] = ^ZiZo^^^i w ( a i)^ the avera g e time 
the path spends in state a, and the indicator functional 
T a [<p>], which equals 1 if cp contains a and otherwise. 



The average time of paths in the ensemble is then given 
by r$ = (T)$ = Z' 1 E^g$ ^M n M- The average path 
length is £$> = (£)$, and the path length distribution is 
given by p 9 (t) = Z^J2 ve4> 5(£ - C\<p])TL\<p] [19 . Let 
£*$ be the standard deviation of path lengths, (T a )$ the 
spatial density of paths (the probability of paths in <£ 
hitting state a), and (7^)$/f$ the fraction of time spent 
in state a. 



Let \tt) be the vector of initial state probabilities and 
\a) be the vector with 1 at position a and otherwise. 
For each step £ and intermediate state a, we can recur- 
sively calculate Pe(cr) = (a\Q £ \7r) and Tg(cr), the total 
probability and average time of all paths that end at a 
in £ steps: 



P e (a') = J2 (*'\QW)Pe-i(v), (1) 

nn a of a' 

W)= J2 W'\QW)[T e -i(cr)+w(a)P f . 1 (a)}, 



nn a or a 



where Po(cr) = 7r(cr), Tq(ct) = 0, and the sums run over 
all nearest neighbors (nn) a of a' (V G 5/ are treated as 
absorbing states). Therefore ,£$ = Y^eLi ^Zaes ^M ") 
and 
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Other ensemble averages, such as 5$, (Z a )®, (Za^a')®, 
and (7^-)$, can be calculated similarly. Furthermore, we 
can calculate mean path divergence that characterizes 
the spatial diversity of the paths in <!> [13] : 
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d(a,a')P e (a)Pe(a'), 



(3) 



where d(a, a 7 ) is a distance metric on S. 

Our algorithm allows for very general definitions of 
the path ensemble <£ without having to explicitly enu- 
merate paths. For instance, $ can include paths that 
begin and end at arbitrary sets of states, or are disal- 
lowed from passing through arbitrary sets of interme- 
diate states. Restriction to first-passage paths is also 
straightforward. The time complexity of our algorithm 
is 0(jNL) (G(^N 2 L) for £>$), where 7 is the average 
number of nn, N is the number of states visited by paths 
in <l>, and L ~ £$> is the cutoff path length. For simple 
random walks, £$ ~ N dw ^ d f for d w > df and 1$ ~ N for 
d w < df, where d w is the dimension of the walk and df 
is the fractal dimension of the space [7J [15]. Therefore, 
the algorithm scales as 0{^N 1+dw ^ d f) for d w > df and 
G(jN 2 ) for d w < df, automatically accounting for the 
sparseness of network connections. 

To determine the cutoff path length L, we recall that 
p<&(£) ~ e -oti/i^ for sufficiently large £ : where a = 0(1) 
(Fig. WO) [15 J. Other path statistics, such as the average 
time f$(£) of paths up to length £ (Fig. SI A), also show 
exponential asymptotic behavior. Therefore in practice 
one need only consider paths with £ < L and infer the 
contributions of all longer paths from an exponential fit 
to the tail, which considerably improves the efficiency of 
the algorithm. This procedure takes advantage of the fact 
that information about longer paths is already contained 
in the structure of shorter paths. 

We now illustrate our approach on two reaction rate 
problems [28 . Consider two metastable states A and 
B with boundaries dA and dB. Let TP denote the en- 
semble of transition paths between A and B: these paths 



begin on either dA or dB and end on the opposite bound- 
ary without crossing any boundaries in between [TTl [18] . 
Similarly, RP denotes the ensemble of paths which return 
to the boundary on which they started. The initial states 
on dA and dB are weighted by the equilibrium distribu- 
tion 7r(<Jo) = e~^ v ^ a ^ jZ for a potential V(<Jo), inverse 
temperature j3 = 1/T, and state-space partition function 
z = Ea e<s e " mao) - B y definition, the first step of all 
TP and RP paths is from dA or dB to a point outside of 
A and B, and the waiting time on dA or dB is zero. 

Many TP statistics, such as the distribution of path 
lengths ptpC0> average time fxp, mean path diver- 
gence Ptp+rp, and the density of states p(a\TP) = 
(7^-)tp/^tp, can be calculated straightforwardly with our 
method (Figs. [I] SI, S2). We approximate the overall 
flux of TPs as the probability of being on a TP divided 
by the average time of a TP [18] : 



g(TP) = (1 - 7T A - 7T B )Z TP 
fxp Z T pTTP + ^RP^rp 



(4) 



where Z^p and Zrp are partition functions for transition 
and return paths, and tta and ttb are the equilibrium 
probabilities of A and B. The reaction rates are given 
by k A ^ B = A/(2tta) and k B ^A = A/(2tt jB ). 

First we consider a 2D double-well potential 
(Fig. [IR), where S is a square lattice with spacing 
Ax and jumps between nn have Monte Carlo rates 
(x', y'\W\x, y) = (Ax)' 2 min[l, e -«v(*V)-v(*,y))]. 
Mean waiting times are given by w(x,y) = 
(Enn {x>,v>) of (x,y)( x 'iy'\ W \ x iy))~ 1 i and jump probabil- 
ities are (x f ,j/\Q\x,y) =w(x,y)(x / ,y / \W\x,y). 

The density of states on transition paths p(x,y\TP) 
shows two symmetric channels by which most reactions 
between A and B occur (Fig. [lp). As noted above, the 
distribution of path lengths pxp {f) is exponential for long 
paths (Fig. [lp). In general, we expect paths to increase 
in length and diversity at higher temperatures. However, 
between /3 = 5 and j3 = 1 the paths become shorter and 
less diverse as T increases (Figs, [lp, SIB, S2). This is 
a signature of entropic switching [29] : at a critical value 
of /3, the two most energetically- favored pathways that 
dominated the low-T behavior become less favorable than 
the shorter path through the middle. Entropic switching 
is reflected in plots of the relative path divergence, fxp, 
£tp, and Stp (Figs, [lp, SIB), which readily generalize 
to arbitrary network spaces. 

We can also calculate the continuous-space limit of the 
TP flux A (Eq.pl and the reaction rates. We analytically 
continue A as a function of Ax: A (Ax) = Ao + X\Ax + 
0(Ax 2 ), where Ao is the continuous-limit flux and Ax 
should be smaller then the smallest length scale of the 
potential. Indeed, X(Ax) is linear (Fig. [lp, inset), yield- 
ing continuous-limit rates of &a^b — ks^A ~ 1.3xl0 -4 . 
Therefore, one need only calculate A at a few finite lat- 
tice spacings to infer continuous-limit rates. As shown 
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FIG. 1. (A) The double-well potential V(x,y) = \(A(l-x 2 - 
y 2f + 2( X * - 2 f + ((x + yf - l) 2 + ((x - yf - l) 2 - 2) M- 
The space S is a square lattice on [—1.6,1.6] x [—1.3,1.3] 
with spacing Ax. The metastable states are defined as 
A = [-1.5, -0.5] x [-0.5, 0.5] and B = [0.5, 1.5] x [-0.5, 0.5]. 
(B) Density of states on TPs p(x,y\TP) for the double-well 
potential. (C) Path length distribution ptp(^) (solid, blue) 
and exponential fit in the interval [L — 50, L] (dashed, green), 
where L = 750. In A,B,C, Ax = 0.05 and f3 = 10. Inset: TP 
flux A as a function of lattice spacing Ax. (D) The relative 
mean path divergence rj = (Ptp+rp(/?)/2>tp+rp(/? = 0)) 1/2 
and average time of TPs ttp versus f3. The divergence r/ is 
calculated using Eq.|3Jwith d(x, y; x', y') — (x — x / ) 2 -\-(y—y / ) 2 . 



in Fig. S3, our approach can be straightforwardly ex- 
tended to reactions on more complex structures such as 
fractals, which serve as models of transport in disordered 
media |B]. 

We now apply our methodology to study evolution of 
protein function; here, the function is defined as bind- 
ing a target such as an enzymatic substrate or another 
protein. Let Ef be the protein folding free energy (i.e., 
the free energy difference between its folded and unfolded 
states), and E^ the free energy of binding relative to the 
chemical potential of the target molecule. Then the pro- 
tein has the probability of folding l/(l-\-e^ Ef ) and, inde- 
pendently, the probability of binding l/{l-\-e^ Eb ), where 
f3 = 1.7 (kcal/mol) -1 is the inverse room temperature. 
We assume that the protein contributes fitness 1 if it both 
folds and binds, and /o < 1 otherwise [30 J. Then fitness 
averaged over all proteins in an organism is given by 



HEf,E b ) = 



(l + e^/Xl + e/ 3 ^) ' 



(5) 



The folding and binding energies are functions of the 
amino acid sequence a. We assume that the protein has 
a small number L of "hotspot" residues at the bind- 



ing interface [31 , and that each residue makes an ad- 
ditive contribution to the total energy [32]: Ef(a) = 
E °f + T,f=i e/(i,<7i)> E b (a) = El + Y^=i ^b(i^i), where 
EpE® are overall offsets and €f^(i,<Ji) is the energy of 
amino acid G{ at position i. The offset E^ is a fixed con- 
tribution to folding energy from the rest of the protein, 
which we assume to be perfectly adapted; e/'s are sam- 
pled from a Gaussian with mean 1.25 kcal/mol and stan- 
dard deviation 1.6 kcal/mol [33]. Since binding hot spots 
typically have a minimum penalty of 1-3 kcal/mol for mu- 
tations away from the wild-type amino acid [34], we set 
e5(i,cr^ est ) = V i (<j best is the best-binding sequence: 
Eb(a hest ) = £■£), and sample the rest of e^'s from an 
exponential distribution defined in the range of (l,oo) 
kcal/mol, with 2 kcal/mol mean [35]. Here we consider 
L = 5 hotspot residues and a reduced alphabet of 8 amino 
acids grouped by physico-chemical properties, resulting 
in 8 5 = 32768 unique sequences. The exact choices of 
these parameters have little effect on the overall qualita- 
tive features of the model. 

Our model naturally incorporates tradeoffs between 
function and stability [25] [26] [36] , even though binding 
and folding are uncorrelated [24]. Furthermore, our fit- 
ness landscape is nonlinear and thus epistatic: the fit- 
ness effect of a given mutation depends on the entire 
background sequence [IH3]. However, our landscape is 
correlated (k L sequences are determined by 2Lk ef^ pa- 
rameters) and thus differs from completely random land- 
scapes [10] in a manner consistent with experimental 

studies [icng. 

We sample one set of e/'s and two sets of e^'s for 
the old binding target and the new one. This proce- 
dure defines two fitness landscapes, T\ and T2 through 
Eq. pi (£"9 and E® are fixed). At first, each organism in 
the population has the sequence with maximum fitness 
under T\. The population then adapts to binding the 
new target on T<i- We assume the strong-selection limit: 
the population can only undergo substitutions that in- 
crease fitness. This limit implies that the population is 
monomorphic: mutations arise one at a time and either 
disappear or fix rapidly [37j [38] . Beneficial substitutions 
occur at a rate Nu <C 1, where N is the effective pop- 
ulation size and u is the mutation rate per amino acid. 
Assuming Markovian waiting times, the jump probabili- 
ties are (cr'|Q|cr) = l/b(a) if ^(a') > 7*(cr) and other- 
wise, where b(a) is the number of beneficial substitutions 
possible from a. Note that in this limit our results are 
independent of /o and Nu only affects the overall time 
scale. The path ensemble consists of all adaptive paths 
(APs). Figure ^1 shows two realizations of T<i with repre- 
sentative APs. 

Since our fitness landscapes (Eq. [5| are randomly gen- 
erated, we focus on their generic properties averaged over 
many realizations of ef and 65 (Figs. [31 S4). Our scans 
of the E^-E® parameter space reveal the existence of two 
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FIG. 2. Two realizations of the fitness landscape. (A) Bind- 
ing phase, with E® = — 17 kcal/mol and _E° = — 3 kcal/mol. 
(B) Folding phase, with E° f = -3 kcal/mol and E$ = 
-17 kcal/mol (E f (a hest ) = -9.4 kcal/mol). Representa- 
tive APs are shown in blue and green. Black star: sequence 
with global maximum on T\\ red diamonds: local maxima on 
Ti shaded according to their commitment probabilities (i.e., 
probabilities to be reached from the initial state); black cir- 
cles: intermediate states along APs, sized proportional to the 
density of APs (X ct )ap; small gray circles: states inaccessible 
to APs: black lines: contours of constant fitness. 



qualitatively different phases of adaptation. One, which 
we call the binding phase, is observed when E^ is low 
and E® is high (see Fig. pK for an example). In this 
case, the mean number of local fitness maxima is very low 
(Fig. [3K,B) and 5f, the average Hamming distance be- 
tween these maxima and the best-folding sequence (with 
the lowest Ef), is large (Fig. [3b). In contrast, £5, the 
average Hamming distance to the best-binding sequence, 
is close to zero. Thus in this phase the need to bind 
dominates adaptation. 

In the opposite limit (high E® and low ££; see Fig. [2b 
for an example), the folding phase is observed in which 
the mean number of local maxima is also low (Fig.[3]A,B) 
but these maxima are much closer to the best-folding 
rather than the best-binding sequence (Fig. [3^3). Here, 
the need to preserve protein stability dominates adap- 
tive dynamics. In the crossover regime between these two 
phases, there are many local maxima and therefore the 
most epistasis. This is reflected in the fact that the frac- 
tion of local maxima accessible from the initial state and 
the probability that the global maximum has the largest 
commitment probability are lower, while the fraction of 
sequence space accessible to APs is higher in this regime 
compared to the binding and folding phases (Fig. [3b). In 
the crossover regime, the tradeoff between binding and 
folding alone can result in proteins with marginal fold- 
ing stability, in contrast with previous hypotheses that 
explain marginal stability with mutational entropy [23] 
or a fitness function that disfavors hyperstable proteins 

On average, paths in the binding phase are longer than 
those in the folding phase, and adaptation takes more 
time (Figs. [3t), S4A). Paths in this regime have higher 
entropy, indicating that adaptation involves a diverse set 
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FIG. 3. (A) Average number of local fitness maxima as a 
function of the energy offsets E® and E®. (B) Average num- 
ber of local fitness maxima (solid, green), the average dis- 
tance Sf between the maxima and the best-folding sequence 
(dashed, blue), and the average distance 5b between the max- 
ima and the best-binding sequence (dotted, red) for the pa- 
rameter subspace E® + E® = —20 kcal/mol. Note that the 
distance between two random sequences is 1 — l//c, where 
k is the size of the alphabet. (C) Fraction of local fitness 
maxima accessible from the initial state (dashed, blue), prob- 
ability that the global maximum has the largest commitment 
probability among all local maxima (dotted, red), probabil- 
ity the initial sequence starts at a local maximum resulting 
in no adaptation (dashed and dotted, cyan), and fraction of 
state space accessible to APs (solid, green). (D) Mean length 
Zap , standard deviation ^p •> an d entropy Sap • All quantities 
are per-residue. The probability of no adaptation in (C) is 
an average over 2 x 10 4 landscape realizations; all other data 
points are averages over 5 x 10 3 realizations, and realizations 
with no adaptation are excluded. 



of intermediate sequences rather than a few dominant 
pathways. The standard deviation of path lengths is also 
higher (Fig. |3t)). In the folding phase APs tend to be 
short since the initial sequence is often either close to, or 
already at a local maximum (Figs. [3fc, S4B). A similar 
situation is observed in directed evolution experiments 
where the initial sequence already has some affinity for 
the new ligand but cannot increase it any further [I2j [40] . 
In such cases, the sequences must first be mutated away 
from the local maximum. Furthermore, in the folding 
phase folding energy tends to increase at the beginning 
of paths and decrease toward the end, as a consequence 
of the distribution of sequences in energy space relative 
to the fitness contours (Fig. ^p). This is consistent with 
experiments in which folding stability is sacrificed first 
and recovered later en route to the new function [26] . 

Our model can be extended to account for binding- 
mediated stability, in which binding stabilizes an other- 



wise disordered protein [41]. We can also incorporate 
chaperone-assisted folding [42] by modifying E^ or the 
distribution of e/'s. Furthermore, we can include "folding 
hotspots" away from the binding interface to see if they 
acquire stabilizing mutations as a buffer against destabi- 
lizing but function- improving mutations at the interface 
[25] [26] . The role of neutral and slightly deleterious mu- 
tations can be studied as well by using substitution rates 
from more complex population genetics models [371 EH] ? 
although we expect non-adaptive substitutions to play 
little role on short time scales. We look forward to study- 
ing these extensions in future work. 

A.V.M. acknowledges support from National Institutes 
of Health (R01 HG004708) and an Alfred P. Sloan Re- 
search Fellowship. 



[i 

[2 
[3 

[4: 

[5; 

[e; 

[7; 

[8 
[9 

[10 

[11 
[12; 
[13 

[14 

[is: 

[16 
[17 

[is; 

[19 



morozov@physics . r ut gers . edu 

D. M. Weinreich, N. F. Delaney, M. A. DePristo, and 

D. L. Hartl, Science 312, 111 (2006). 

F. J. Poelwijk, D. J. Kiviet, D. M. Weinreich, and S. J. 

Tans, Nature 445, 383 (2007). 

M. Carneiro and D. L. Hartl, Proc. Natl. Acad. Sci. USA 

107, 1747 (2010). 

F. Noe, C. Schiitte, E. Vanden-Eijnden, L. Reich, and 
T. R. Weikl, Proc. Natl. Acad. Sci. USA 106, 19011 
(2009). 

P. G. Bolhuis, D. Chandler, C. Dellago, and P. L. 
Geissler, Ann. Rev. Phys. Chem. 53, 291 (2002). 
D. ben Avraham and S. Havlin, Diffusion and Reactions 
in Fractals and Disordered Systems (Cambridge, Cam- 
bridge, 2000). 

S. Condamin, O. Benichou, V. Tejedor, R. Voituriez, and 
J. Klafter, Nature 450, 77 (2007). 

D. M. Roma, R. A. O'Flanagan, A. E. Ruckenstein, and 
A. M. Sengupta, Phys. Rev. E 71, 011902 (2005). 

C. H. Waddington, The Strategy of the Genes. A Discus- 
sion of Some Aspects of Theoretical Biology (Allen and 
Unwin, London, 1957). 

S. Kauffman, The Origins of Order: Self- Organization 
and Selection in Evolution (Oxford University Press, 
New York, 1993). 

T. Enver, M. Pera, C. Peterson, and P. W. Andrews, 
Cell Stem Cell 4, 387 (2009). 

J. T. Bridgham, E. A. Ortlund, and J. W. Thornton, 
Nature 461, 515 (2009). 

A. E. Lobkovsky, Y. I. Wolf, and E. V. Koonin, PLoS 
Comput. Biol. 7, el002302 (2011). 

J. D. Noh and H. Rieger, Phys. Rev. Lett. 92, 118701 
(2004). 

E. M. Bollt and D. ben Avraham, N. J. Phys. 7, 26 
(2005). 

C. Dellago, P. G. Bolhuis, F. S. Csajka, and D. Chandler, 
J. Chem. Phys. 108, 1964 (1998). 

C. Dellago, P. G. Bolhuis, and P. L. Geissler, Adv. Chem. 
Phys. 123, 1 (2003). 

G. Hummer, J. Chem. Phys. 120, 516 (2004). 

B. Harland and S. X. Sun, J. Chem. Phys. 127, 104103 
(2007). 



[20 
[21 
[22 
[23 
[24 
[25 
[26 
[27 

[28 

[29 

[30 

[31 
[32; 

[33 

[34 
[35; 

[36 

[37 

[38 
[39 

[40; 

[41 
[42; 



W. E and E. Vanden-Eijnden, J. Stat. Phys. 123, 503 
(2006). 

P. Metzner, C. Schiitte, and E. Vanden-Eijnden, Multi- 
scale Model. Simul. 7, 1192 (2009). 

G. H. Weiss, Aspects and Applications of the Random 
Walk (North Holland, Amsterdam, 1994). 
K. B. Zeldovich, P. Chen, and E. I. Shakhnovich, Proc. 
Natl. Acad. Sci. USA 104, 16152 (2007). 
N. Tokuriki, F. Stricher, L. Serrano, and D. S. Tawfik, 
PLoS Comput. Biol. 4, el000002 (2008). 
N. Tokuriki and D. S. Tawfik, Curr. Opin. Struct. Biol. 
19, 596 (2009). 

J. D. Bloom and F. H. Arnold, Proc. Natl. Acad. Sci. 
USA 106, 9995 (2009). 

T. A. Whitehead, A. Chevalier, Y. Song, C. Dreyfus, S. J. 
Fleishman, C. De Mattos, C. A. Myers, H. Kamisetty, 
P. Blair, I. A. Wilson, and D. Baker, Nature Biotech. 
30, 543 (2012). 

P. Hanggi, P. Talkner, and M. Borkovec, Rev. Mod. 
Phys. 62, 251 (1990). 

P. Metzner, C. Schiitte, and E. Vanden-Eijnden, J. 
Chem. Phys. 125, 084110 (2006). 

S. Mayer, S. Riidiger, H. C. Ang, A. C. Joerger, and 
A. R. Fersht, J. Mol. Biol. 372, 268 (2007). 
T. Clackson and J. A. Wells, Science 267, 383 (1995). 
L. Serrano, A. G. Day, and A. R. Fersht, J. Mol. Biol. 
233, 305 (1993). 

N. Tokuriki, F. Stricher, J. Schymkowitz, L. Serrano, and 
D. S. Tawfik, J. Mol. Biol. 369, 1318 (2007). 
A. A. Bogan and K. S. Thorn, J. Mol. Biol. 280, 1 (1998). 
K. S. Thorn and A. A. Bogan, Bioinformatics 17, 284 
(2001). 

X. Wang, G. Minasov, and B. K. Shoichet, J. Mol. Biol. 
320, 85 (2002). 

J. F. Crow and M. Kimura, An Introduction to Pop- 
ulation Genetics Theory (Harper and Row, New York, 
1970). 

M. Manhart, A. Haldane, and A. V. Morozov, Theor. 
Pop. Biol. 82, 66 (2012). 

M. A. DePristo, D. M. Weinreich, and D. L. Hartl, Nat. 
Rev. Genet. 6, 678 (2005). 

S. Bershtein, K. Goldin, and D. S. Tawfik, J. Mol. Biol. 
379, 1029 (2008). 

C. J. Brown, A. K. Johnson, A. K. Dunker, and G. W. 
Daughdrill, Curr. Opin. Struct. Biol. 21, 441 (2011). 
S. L. Rutherford, Nat. Rev. Genet. 4, 263 (2003). 



Supplementary Material: 

A path-based approach to random walks on networks 

characterizes how proteins evolve new function 

Michael Manhart 1 and Alexandre V. Morozov 1 ' 2 

1 Department of Physics and Astronomy, Rutgers University, Piscataway, NJ 08854 
2 BioMaPS Institute for Quantitative Biology, Rutgers University, Piscataway, NJ 08854 



T T M 




'TP 



500 1000 1500 2000 2500 




FIG. SI. (A) For the ensemble of transition paths (TPs) in the 2D double- well potential, the mean time ttp(£) of paths up to 
length £. In the limit £ — > oo, ttp(£) converges to the total mean time fxp. Similar to ptp(£) in Fig. 1C, for sufficiently large £ 
ttp(£) acquires a universal exponential form: (ttp — ttp{£)) ~ e _a ^ TP . We choose L — 750 as the effective cutoff used to fit 
an exponential tail of ttp(^); fit in the range £ 6 [L — 50, L] (dashed, green) closely matches the full calculation (solid, blue) for 
£ > L. (B) The mean length £tp (dashed, blue) and standard deviation ^ P (dotted, red) as functions of inverse temperature 
(3. Both quantities have units of area since they are rescaled by (Ax) 2 . Also shown is the entropy Stp (solid, green) of TPs. 
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FIG. S2. For the 2D double-well potential, the equilibrium distribution of states 7r(x,y), density of states on TPs p(x,y\TP), 



and TP densities (given that the system is at (x,y), the probability it is on a TP: p(TP\x, y) 
£Rp(Z(x,y))Rp)) at different values of f3. All calculations use Ax = 0.05. 
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FIG. S3. Sierpinski triangle embedded in the triple- well potential V(x,y) = 10£? =1 ((a! - x;) 2 + (y - yi) 2 )e~ Hx - x ^ - 5 (v-Wi) , 
where (#1,2/1) — (0, l/\/3), (2:2,2/2) = (1/2, — 1/(2<\/3)), and (2:3,2/3) = (—1/2,-1/(2^)). There are three metastable states 
outlined in black, one at each corner of the triangle. Monte Carlo jump rates are rescaled by (Ax) d ™, where Ax = 2~ n (n 
is the fractal order) and d w — log 5/ log 2 is the dimension of a random walk on the Sierpinski triangle. (A) Transition path 
flux A as a function of lattice spacing Ax. As with the double-well potential, analytic continuation of A(Ax) allows us to infer 
the reaction rate k ~ A/ 2 w 2.0 x 10 -2 between any pair of metastable states in an infinite-order fractal from a few finite 
realizations. (B) The potential V(x,y), the density of states on TPs p(x,y\TP), and TP densities p(TP\x,y) for /3 = 6. 
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FIG. S4. (A) Mean time tap (in units of (A^) _1 ) of APs for the subspace E° f + E% = -20 kcal/mol. (B) Plot of average 
path length Iap (dashed, blue), standard deviation of path length ^^ P (dotted, red), maximum possible path length (dashed 
and dotted, cyan), and the average net distance 5 between the initial state and final state (solid, green) for the parameter 
subspace E® + E® = —20 kcal/mol. On average, proteins will undergo twice as many substitutions as the average distance 5. 
The maximum number of substitutions is at least 3£. The crossover regime allows for the greatest variation in path lengths, 
both in terms of standard deviation and maximum possible length. All quantities are per-residue, and data points are averages 
from 5 x 10 3 landscape realizations. 



